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Abstract: As with all measurements, the measurement of examinee ability, in terms of scores that the 
examinee obtains in a test, is also error-ridden. The quantification of such error or uncertainty in the test 
score data-or rather the complementary test reliability-is pursued within the paradigm of Classical Test 
Theory in a variety of ways, with no existing method of finding reliability, isomorphic to the theoretical 
definition that parametrises reliability as the ratio of the true score variance and observed (i.e. error-ridden) 
score variance. Thus, multiple reliability coefficients for the same test have been advanced. This paper 
describes a much needed method of obtaining reliability of a test as per its theoretical definition, via 
a single administration of the test, by using a new fast method of splitting of a given test into parallel 
halves, achieving near-coincident empirical distributions of the two halves. The method has the desirable 

property of achieving splitting on the basis of difficulty of the questions (or items) that constitute the test, 
thus allowing for fast computation of reliability even for very large test data sets, i.e. test data obtained 

by a very large examinee sample. An interval estimate for the true score is offered, given an examinee 
score, subsequent to the determination of the test reliability. This method of finding test reliability as 

‘Former Director of Indian Maritime University, Kolkata Campus, India 
^Ph.D student. Department of Mathematics, University of Leicester, U.K. 

^Lecturer of Statistics at Department of Mathematics, University of Leicester and Associate Research fellow at Department of 
Statistics, University of Warwick 


1 



Chakrabartty, Wang & Chakrabarty/Reliability using Classical Definition 


2 


per the classical definition can be extended to find reliability of a set or battery of tests; a method for 
determination of the weights implemented in the computation of the weighted battery score is discussed. 

We perform empirical illustration of our method on real and simulated tests, and on a real test battery 
comprising two constituent tests. 

Keywords and phrases: Reliability, True score variance. Error variance. Battery of tests. 

1. Introduction 

Examinee ability is measured by the scores obtained by an examinee in a test designed to assess such ability. 
As with all other measurements, this ability measurement too is fundamentally uncertain or inclusive of errors. 
In Classical Test Theory, quantification of the complementary certainty, or reliability of a test, is defined as 
fhe ratio of fhe frue score variance and fhe observed score variance i.e. reliabilify is defined as fhe proportion 
of observed fest score variance fhaf is affribufable fo frue score. Here, fhe observed score is freafed as inclusive 
of fhe measuremenf error. This fheorefical definifion nofwifhsfanding, fhere are differenf mefhods of obfaining 
reliabilify in practice, and problems arise from fhe implemenfafion of fhese differenf fechniques even for fhe 
same fesf. Imporfanfly, if is fo be noted that the different methods of estimating reliability coefficients differ 
differently from the aforementioned classical or theoretical definition of reliability. This can potentially result 
in different estimates of reliability of a particular test even for the same examinee sample. Berkowitz et al. 
(2000) defined reliabilify as fhe degree fo which fesf scores for a group of fesf fakers are consisfenf over 
repeafed applications of fhe fesf, and are Iherefore inferred fo be dependable and repeafable for an individual 
fesf laker. In Ihis framework, high reliabilify means lhal fhe fesf is Iruslworlhy. Jacobs (1991) and Sallerly 
(1994) have opined lhal anofher way fo express reliabilify is in terms of the standard error of measurement. 
Rudner &: Schafes (2002) mentioned that it is impossible to calculate a reliability coefficient that conforms 
to the theoretical definition of the ratio of true score variance and observed score variance. Webb, Shavelson 
Sz Haertel (2006) have suggested that the theoretical reliability coefficient is not practical since true scores of 
individuals taking the test are not known. In fact, none of the existing methods of finding reliabilify is found 
fo be isomorphic fo fhe classical definition. 

In fhis paper we presenf a new mefhodology fo compute the classically defined reliabilify coefficienf of 
a mulliple-choice binary fesf, i.e. a fesf, responses fo fhe questions of which is eilher correcl or incorrecf, 
fefching a score of 1 or 0 respecfively. Thus, our melhod gives fhe reliabilify as per fhe classical definilion- 
fhereby avoiding confusion abouf differenf reliabilify values of a given fesf. The melhod fhaf we advance does 
nol resort fo mulliple lesling, i.e. adminislering fhe same fesf mulliple times fo a given examinee cohort. While 
avoiding mulliple fesling, fhis melhod has fhe importanl addilional benefil of identifying fhe way of splilling 
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of the test, in order to ensure that the split halves are equivalent (or parallel), so that they approach maximal 
correlation and therefore maximal split-half reliability. We offer an interval estimate of the true score of any 
individual examinee whose obtained score is known. 

This paper is organised as follows. In the following section, we discuss the methods of finding reliability 
as found in the current literature. In Section 2.1, we present a similar discussion, but this time, in regard to a 
set or battery of tests. Our method of finding fhe reliabilify as per fhe classical definifion, is expounded upon 
in Secfion 3; fhis is based upon a novel mefhod of spliffing a fesf info fwo parallel halves using a new iferafive 
algorifhm fhaf is discussed in Section 4.2 along wifh fhe benefifs of fhis iferafive scheme. Subsequenfly, 
we proceed fo discuss fhe compufafion of fhe reliabilify of a baffery of fesfs using summafive scores and 
weighfed scores (Secfion 5.1). Empirical verification of fhe presenfed mefhodology is underfaken using four 
sefs of simulafed dafa (in Secfion 6). Implemenfafion of fhe mefhod fo real dafa is discussed in Secfion 7. We 
round up fhe paper in Secfion 8, by recalling fhe salienf oufcomes of fhe work. 

2. Currently available methods 

In the multiple testing (or the test-retest) approach, the main concern is that the correlation between test 
scores coming from distinct administrations, depends on the time gap between the administrations, since the 
sample may learn or forget something in between and also get acquainted to the test. Thus, different values 
of reliability can be obtained depending on the time gap (Meadows &: Billington 2005). Also, test-retest 
reliability primarily reflects stability of scores and may vary depending on the homogeneity of groups of 
examinees (Gualtieri, Thomas & Lynda 2006). 

In the split half method, the test is divided into two sub-tests, i.e. the whole set of questions is divided into 
two sets of questions and the entire instrument is administered to one examinee sample. The split-half relia¬ 
bility estimate is the correlation between scores of these two parallel halves, though researchers like Kaplan 
k. Saccuzzo (2001) recommend finding reliability of the entire test using the Spearman-Brown formula. Value 
of the split-half reliability depends on how the test is dichotomised. We acknowledge that a test consisting 
of 2r?-number of items can be split in half in ( ^ ) number of ways and that each method of splitting will 

\vj 

imply a distinct value of reliability in general. In this context, it merits mention that it is of crucial importance 
to identify that way of splitting that ensures that the split halves are parallel and the split-half reliability is 
maximum. 

In attempts to compute reliability based on “internal consistency”, the aim is to quantify how well the 
questions in the test, or test “items”, reflect the same construct. These include the usage of Cronbach’s Alpha 
(a): the average of all possible split-half estimates of reliability, computed as a function of the ratio of the 
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total of the variance of the items and variance of the examinee scores. Here, the “item score” is the total of the 
scores obtained in that item by all the examinees who are taking the test, while an examinee score is the total 
score of such an examinee, across all the items in the test, a can be shown to provide a lower bound for the 
reliability under certain assumptions. Limitations of this method have been reported by Eisinga, Te Grotenhuis 
Sz Pelzer (2012) and Ritter (2010). Panayides (2013) observed that high values of a do not necessarily mean 
higher reliability and better quality of scales or tests. In fact very high values of a could be indicating lengthy 
scales, a narrow coverage of the construct under consideration and/or parallel items (Boyle 1991; Kline 1979) 
or redundancy of items (Streiner 2003). 

In another approach, referred to as the “parallel forms” approach, the test is divided into two forms or sub¬ 
tests by randomly selecting the test items to comprise either sub-test, under the assumption that the randomly 
divided halves are parallel or equivalent. Such a demand of parallelity requires generating lots of items that 
reflect the same construct, which is difficult to achieve in practice. This approach is very similar to the split- 
half reliability described above, except that the parallel forms constructed herein, can be used independently 
of each other and are considered equivalent. In practice, this assumption of strict parallelity is too restrictive. 

Each of the above method of estimation of reliability has certain advantages and disadvantages. However, 
the estimation of the reliability coefficient under each such method deviates differently from the theoretical 
definition and consequently, gives different values for reliability of a single test. In general, test-retest reliabil¬ 
ity with significant time gap is lower in value than the parallel forms reliability and reliability computed under 
the model of internal consistency amongst the test items. To summarise, there is a potency for confusion over 
the trustworthiness of a test, emanating out of the inconsistencies amongst the different available methods 
that are implemented to compute reliability coefficients. This state of affairs motivates a need to estimate re¬ 
liability in an easily reproducible way, from a single administration of the test, using the theoretical definition 
of reliability i.e. as a ratio of true score variance and observed score variance. 

2.1. Battery reliability 

Bock (1966) derived reliability of a battery using Multivariate Analysis of Variance (MANOVA), where r 
number of parallel forms of a test battery were given to each individual in the examinee sample. It can be 
proved that average canonical reliabilities from MANOVA coincides with the canonical reliability for the 
Mahalanobis distant function. Wood & Safrit (1987) compared three multivariate models (canonical relia¬ 
bility model, maximum generalisability model, canonical correlation model) for estimating reliability of test 
battery and observed that the maximum generalisability model showed the least degree of bias, smallest er¬ 
rors in estimation, and the greatest relative efficiency across all experimental conditions. Ogasawar (2009) 
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considered four estimators of the reliability of a composite score based on a factor analysis approach, and five 
estimators of the maximal reliability for the composite scores and found that the same estimators are obtained 
when Wishart Maximum Likelihood estimators are used. Conger Sz Lipshitz (1973) considered canonical re¬ 
liability as the ratio of the average squared distance among the true scores to the average squared distance 
among the observed scores. They thereby showed that the canonical reliability is consistent with multivari¬ 
ate analogues of parallel forms of correlation, squared correlation between true and observed scores and an 
analysis of variance formulation. They opined that the factors corresponding to the small eigenvalues might 
show significant differences and should not be discarded. Equivalences amongst canonical factor analysis, 
canonical reliability and principal component analysis were studied by Conger &: Stallard (1976); they found 
that if variables are scaled so that they have equal measurement errors, then canonical factor analysis on all 
non-error variance, principal component analysis and canonical reliability analysis, give rise to equivalent 
results. 

In this paper, the objective is to present a methodology for obtaining reliability as per its theoretical defini¬ 
tion from a single administration of a test, and to extend this methodology to find ways of obtaining reliability 
of a battery of tests. 


3. Methodology 


Consider a test consisting of n items, administered to N persons. The test score of the i-th examinee is 
Xi, i = 1,..., and X = (Xi, X 2 ,..., Xat)^ is referred to as the test score vector. The vector I = 
(Ji, / 2 ) ■ ■ ■ depicts the maximum possible scores for the test where = n Vz = 1,2,..., A^. Here 

X , I £ X C where we refer to X as the A^-dimensional person space. Sorting the components of X 
in increasing order will produce a merit list for thesample of examinees, in terms of the observed scores and 
thereby help infer the ability distribution of the sample of examinees. However, parameters of the test can 
be obtained primarily in terms of the 2-norm of the vectors X and bl, i.e. || X ||, || / ||, and the angle 9x 
between these two vectors. 


Mean of the test scores X is given as follows. Since cos 6x 





i 


X II cos Ox 
s/N 


- 11 T 11 cos 9t 

X. Similarly, mean of the true score vector T is T = - -= -where 9t is the angle between the true 


Vn 


score vector and the ideal score vector I. 


Test variance can also be obtained in this framework. For N persons who take the test, = 


1 


X 


1 

N 

The relationship between the norm of deviation score and norm of test score can be easily derived as 


Sx = II X II where Xi = Xi — X is the z-th deviation score, z = 1, 2,... , A^. 
VN 
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X |p=|| X IP sin^ 9x Similarly, || t |p=|| T |p sin^ 9t where ti = Ti — T. 

Using this relationship between deviation and test scores, test variance can be rephrased as 

X IP sm^9x 


q2 _ 


N 


Also = 


T IP sin^ 9t 


N 


(3.1) 


The aforementioned relationship between test and deviation scores gives sin^ 9x = 


X 


X IP 


and sin^ 9t = 


||T|p‘ 

At the same time, in Classical Test Theory, the means of the observed and true scores coincide, i.e. X = T, 
as in this framework, the observed score is represented as the sum of true score (obtained under the paradigm 
of no errors) and the error, where the error is assumed to be distributed as a normal with zero mean and 
variance referred to as the error variance. Then recalling the definition of these sample means, we get 


cos 9t = 


X 


T 


■ cos 9x ■ 


(3.2) 


This gives the relationship amongst || X ||, || T ||, 9t, 9x- The stage is now set for us to develop the 
methodology for computing reliability using the classical definition. 


4. Our method 
4.1. Background 

Si 

Reliability, ru, of a test is defined clasically as ru = Thus, to know ru one needs to know the value of 
the true score variance or the error variance. For this, let us concentrate on parallel tests. As per the classical 
definition, two tests “p” and “/i” are parallel if 

t( 9) ^ rj.{h) i = i^2,...,N ( 4 . 1 ) 

(v) 

where the superscript “g” refers to sub-test g and superscript “h” to sub-test h, and Se ' is the standard 
deviation of the error scores in the p-th sub-test, p = g,h. Thus, is the true score of the i-th examinee 
in the p-th test and that in the /i-th test. Similarly, the observed scores of the f-th examinee, in the two 
tests, are and X^^\ 

From here on, we interpret “g” and “h” as the two parallel sub-tests that a given test is dichotomised into. 
This implies that the observed score vectors of these two parallel tests can be represented by two points Xg 
and X/i, in A such that Xg = Tg + Eg and Xh = Th + Eh in the paradigm of Classical Test Theory, where 
Ep is the error score in the p-th test, p = g,h. (Here Xp = {x'f'^ , • • •, X^'^), p = g, h). Now, recalling that 
for parallel tests g and h, = I,... ,N Tg = Th Xg — Xh = Eg — Eh, so that 

II Xg IP + II Xh IP -2 II Xg III! Xh II cos^g;, =|| Eg ^ + || ^ -2 || Eg |||| Eh || 008 ^^^ ( 4 . 2 ) 
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(F') 

where Ogh is the angle between Xg and Xh while is the angle between Eg and Eh- But, the correlation 
of the errors in two parallel tests is zero (follows from equality of standard deviation of error sores in the 
parallel sub-tests and equality of means of the error scores-see Equation 3.2). The geometrical interpretation 
of this is that the error vectors of the two parallel tests are orthogonal, i.e. cos 6^^^ = 0. Then Equation 4.2 
can be written as 


f + II ^ 


h ll" -2 II Xg 


X h II cos Ogh 

= 11 Eg f + II Eh f = N{S[3)f + = 2N{S[3)f (4.3) 

where we have used equality of error variances of parallel tests in the last step. Equations 4.3 can be employed 
to find the value of from the data (on observed scores). In other words, we can use the available test score 
data in this equation to achieve the error variance of either parallel sub-test that a test can be dichotomised 
into. Alternatively, if two parallel tests exist, then Equations 4.3 can be invoked to compute the error variance 
in either of the two parallel tests. 


Now, Equations 4.3 suggest that the error variance of the entire test is 


(^ Pe . t ))2 ^ 2 (^( 3))2 = 


+ II Xh f -2 II Xg 


Xh II cosO. 


gh 


N 


(4.4) 


Then recalling that in the paradigm of Classical Test Theory, the true score is by definition independent of 
the error, it follows that the observed score variance Sy is sum of true score variance and error variance 


(^(test ))2 classical definition of reliability gives 

Si 


_ _ 1 
■ Si - 


{test) ^2 


S\ 


1 -- 


V ||2 

^9 II 


+ II Xh f -2 II Xg 


Xh II cos 6 


gh 


NS), 

N 


= !-■ 


V ||2 

II 


+ II Xh f 


iVSi 


(4.5) 


As II Xg IIII Xh II cos Ogh = x'f^x\^\ Equation 4.5 can be simplified fo give 


Z=1 


ru = l- 


2||Xgf 


NSl 


(4.6) 


since II Xg || = || Xh || for parallel fesfs, given fhaf Xg = and Se^^ = Sy^''. Equafion 4.5 and Equation 4.6 
give a unique way of finding reliabilify of a fesf from a single adminisfrafion, using fhe classical definition of 
reliabilify as long as dichofomisafion of fhe fesf is performed info fwo parallel sub-fesfs. Imporfanfly, in fhis 
framework, fhe hypofhesis of equalify of error variances of fhe fwo halves of a given fesf can be fesfed using 
an E-fesf. In addifion, fhe mefhod also provides a way fo esfimafe frue scores from fhe dafa. We discuss fhis 
later in fhis section. 


fih) 
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4.2. Proposed method of split-half 

Chakrabartty (2011) gave a method for splitting a test into 2 parallel halves. Here we give a novel method 
of splitting a test into 2 parallel halves-^' and /i-that have nearly equal means and variances of the observed 

scores. The splitting is initiated by the determination of the item-wise total score for each item. So let the j-th 

N 

E (i) f 7 ) 

X~ , where Xf is the i-th examinee’s score in the j-th 

i=l 

item, j = 1,... , n. Our method of splitting is as follows. 

Step-1 The item-wise scores are sorted in an ascending order resulting in the ordered sequence ti,T 2 , ■ ■ ■ ,Tn. 
Following this, the item with the highest total score is identified and allocated to the p-th sub-test. The 
item with second highest total score is then allocated to the h-th test, while the item with the third 
highest score is assigned to h-th test and the fourth highest to the p-th test, and so on. In other words, 
allocation of items is performed to ensure realisation of the following structure. 

sub-test g sub-test h difference in item-wise scores of 2 sub-test 

Ti T2 n - r2 > 0 

T4 T3 T4 - T3 < 0 

where we assume n to be even; for tests with an odd number of items, we ignore the last item for the 
purposes of dichotomisation. The sub-tests obtained after the very first slotting of the sequence 
into the sub-tests, following this suggested method of distribution, is referred to as the “seed sub-tests”. 
Step-11 Next, the difference of item-wise scores in every item of the ^r-th and /i-th sub-tests is recorded and the 
sum S of these differences is computed (total of column 3 in the above table). If the value of S is zero, 
we terminate the process of distribution of items across the 2 sub-tests, otherwise we proceed to the 
next step. 

Step-111 We identify rows in the above table, the swapping of the entries of columns 1 and 2 of which, results 
in the reduction of |5|, where | • | denotes absolute value. Let the row numbers of such rows be 
I = 1,2,..., nW where < n/2. We swap the p^*^'^-t\\ item of the p-th sub-test with the 
th item of the h-th sub-test and re-calculate sum of the entries of the revised p-th sub-test and h-th 
sub-test. If the revised value of |5| is zero or a number close to zero that does not reduce upon further 
iterations, we stop the iteration; otherwise we return to the identification of the row numbers and 
proceed therefrom again. 

We considered other methods of splitting of the test as well, including methods in which swapping of items 
of the two sub-tests is allowed not just for a given row, but also across rows, and minimisation of the sum of 
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the differences between the items in the g-\h and /i-th sub-tests is not the only criterion. To be precise, we 
conduct methods of dichotomisation in which in Step-I we compute the sum S of the differences between 
the item scores in the 2 sub-tests, as well as consider the sum Sgq of differences of squares of the item scores 
in the g-th and /i-th sub-tests. We then identify row numbers and jn above table, such that 

swapping the item in row of the p-th sub-test with the entry in row of the /i-th sub-test results in 

a reduction of |5| x |5sq|; < n/2. When this product can no longer be reduced over a chosen 

number of iterations, the scheme is stopped, otherwise the search for the parallel halves that result in the 
minimum value of this product, is maintained. However, parallelisation obtained with this method of splitting 
that seeks to minimise |5| x |5sq| was empirically found to yield similar results as with the method that seeks 
to minimise |5|. Here by “similar results” is implied dichotomisation of the given test into two sub-tests, the 
means and variances of which are equally close to each other in both methods, i.e. sub-tests are approximately 
equally parallel. The reason for such empirically noted similarity is discussed in the following section. Given 
this, we advance the method enumerated above in Steps-I to III, as our chosen method of dichotomisation. 

By the mean (or variance) of a sub-test, is implied mean (or variance) of the vector of examinee scores in 
the n/2 items that comprise that sub-test, i.e. the mean (or variance) of the score vector of the sub-test. Thus, 
if the item numbers pi,..., p „/2 comprise the p-th sub-test, pj G {1, 2,... , n}, j = 1,..., n/2, then the 

njl 

A^-dimensional score vector of the p-th sub-test is := {X^\ ... ,X^'^)'^, where X^^'^ = X^^^\ i.e. 

i=i 

the score acieved by the i-th examinee across all the items that are included in the p-th sub-test; this is to be 

U) 

distinguished from Tj-the score in the p-th item, summed over all examinees. Given that X/ is either 0 or 
1, we get that X^^'^ < n/2 and Tj < N, i,j. We similarly define the score-vector X^^^ of the h-th sub-test. 

Also, in he following section we identify score in the p-th item that is constituent of the p-th sub-test as 
Tj^\j = 1..., n/2. Similarly we define 

Remark 4.1. The order of our algorithm is independent of the examinee number and driven by the number 
of items in each sub-test that the test ofn items is dichotomised into; the order is 0((n/2)^). 

4.3. Benefits of our method of parallelisation 

Theorem 4.1. Difference between means ofg-th and h-th sub-tests is minimised. 

Proof Let the score vectors of the p-th and /i-th sub-tests be X^^'^ := {X^\ ... ,X^'^)'^ and X^^'^ := 
{x[^\ ..., X^'^)'^, where X^ ^ ■ Now, mean of the p-th sub-test is Xg = ^ ^^ — and mean of 
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N 


the h-th sub-test is Xh = 


i=l 


(h) 


N 


Let item score of the j-th item in the g-th sub-test be j = 1,..., n/2. Similarly, let score of j-th item 
in the h-th sub-test be . 

But, sum of item scores in all n/2 items in a sub-test, is equal to sum of examinee scores achieved in these 
n/2 items, i.e. 

N n/2 

3=t 


i=l 

N 


= E 


n/2 


Sh) 


(4.7) 


i=l 


j=l 


Now, by definition. 


| 5 | = 


Ua) M 


-h 

\ (g) , I ffg)' 

[^1 + • • • + ^n/2 

- 


‘n/2 ‘n/2 

rf^ + ... + r 


(4.8) 


(/^)■ 

n/2 


At the end of the splitting of the test, let |5| = e. Then e is the minimum value of |5| by our method. 
Then using 


in Equations 4.7, we get 


Then 




Xg-Xh 

— 


1 1 

+ 

• ^n/2 

— 

rf) + .. 

^ n/2j 

= € 

N 

N 


n/2 

n/2 


lEE”- 


= 

|E-i” 

-ET> 

= e 

2=1 

2=1 


j=t 

j=t 


E sfi E E ’ 






2 = 1 


2=1 


N 


N 


minimised by our method. 


(4.9) 


= e/N (using Equation 4.9), i.e. difference between means is 

□ 


Remark 4.2. In our work, by “near-equal means ”, we imply means of the sub-tests, the difference between 
which is minimised using our method; typically, this minimum value is close to 0. Thus, the means of the two 
sub-tests, are near-equal. 





























Chakrabartty, Wang & Chakrabarty/Reliability using Classical Definition 


11 


Theorem 4.2. Absolute difference between sums of squares of examinee scores in the g-th and h-th sub-tests 
is of the order of e^, if absolute difference between sums of examinee scores is e. 

Proof Absolute difference between sums of scores g-th and h-th sub-tests is minimised in our method (Equa¬ 
tion 4.9), with 

N N 


2=1 


2 = 1 


For our method, e is small. Thus we state: 


N N 


2=1 


2=1 


(4.10) 


so 


N N 

( 9 ) _V \ ' 

i 

i=l i=l 

Here we define 


that T:=Y^ ^ xf ^ = T ^ e. 


nl2 

X^d) _ yiOj) 

^2 / V ^2 


j = l 


and similarly, we define xf^\ Vi = 1,..., A^. 
Now, 


N \ 2 

N 

y V 0 



E-^.“ 

II 

M 


+ 


i=l / 

i=l 






xfi + 

.. + X 

( 9 ) yia), 

1 + 


... + 




An 

x[^^ + 

.. + x 

(s) 

N ^N-1 


N 

, K 9 

N 

N 


II 

M 

xfi) 

+E 

E xfix 


2=1 


2 = 1 

k=l;k^i 


(4.11) 
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Now, for k ^ i, 


= (xr^ + ... + x 


‘■fi 


(91) 


{9n/2)\ f '^( 91 ) 


X''^^’ + ... + X, 


( 971 / 2 ) 


j^( 9 l)j^( 9 l) _j_ J^( 9 i)j^( 92 ) _|_ _|_ (91) (977/2) 

2 A) 2 /c 2 

... + 

^^( 5 n/ 2 )^^( 5 'l) _j_ ji^( 5 TT-/ 2 )^^( 5 ' 2 ) _j_ I ^^(ff-n./2)^^(ff-n./2) | 

2 2 2 


n/2 

E 

J = 1 


n/2 


^number of times each of Xj-'^ and X^'^ is 1 

j'=i 


n/2 


n/2 

V (number of times x/^'^ = l) 


V (number of times x(®^^ = l) 



* ^ \ / 
_9=1 


(n/2)' 


nf2 

j;Pr(X^) = 1) 


nf2 

^Pr(4»> = 1) 

i=i 


(4.12) 


7 ( 9 / 


Here X/ ^ is the score of the i-th examinee in the j-th item of the (/-th sub-test and can attain values of either 
1 or 0, with probability or respectively. Thus, ~ Bernoullii.e. Pr(Xj^^^^ = 1) = 

p(9j) approximation in Equation 4.12 stems from approximating the probability for an event, with its 
relative frequency. Then following Equation 4.12, we get 

N N N N 


E E 

2=1 k=l;k^i 


("/2)T E 

2=1 k=l;k^i 
^ 22/2 

{n/2)^Yl 


2 = 1 


n/2 22/2 

^^(9.) ^pfe) 

i=i 9=1 

N nl2 


Y.P. 

9=1 


( 99 ) 


" E Ef>7 

k=l\k^i j=l 


i9j) 


(4.13) 


Now, Equation 4.10 implies that 


N 

E 

2 = 1 


n/2 




N 

E 

2=1 


n/2 


j=l 

( 99 ) 


N 


y/ ^number of times Xf^ = 1 

9=1 

N [ nl2 

E Ep 

1=1 9=1 


(99) 


E 

2=1 

N 

E 

2=1 

N 

E 

2=1 


I.e. 


n/2 

Ex*'-’ 

n/2 

^number of times X^^^^ ^ = 1 
9=1 

’n/2 

9=1 


I.e. 


(4.14) 


Then if we delete any 1 out of the X examinees, over which the outer summation on the RHS and EHS of 
the last approximate equality is carried out, it is expected that the approximation expressed in statement 4.14 
would still be valid. This is especially the case if N is large. In other words, bigger the X, smaller is the 
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distortion affected on the structure of the sub-tests generaetd by splitting the test data obtained after deleting 
the score of any 1 of the N examinees from the original full test data. Then using statement 4.14 for a large 


N, we can write 


where 


N 


E 

k=l,k^i 


n/2 




N 


E 

k=l]k^i 


n/2 




N 


E 


n/2 

N 

'n/2 


II 

M 

SX i^j) 

2^Pi 

_i=i 

2=1 

_i=i 


±e', 


(4.15) 


N N 

follows from Equation 4.10 that suggests that ± e. Thus e G Z>o and e' G M>o such 

2=1 2=1 

that e > e', given that e' is the absolute difference between sums of probability of correct response in the 


5 -th sub-test and that in the h-th sub-test while e is the absolute difference between sums of scores in the two 


sub-tests. 


In other words, if we define the sum of probabilities of correct response in the 5 -th sub-test, T', as 


T' 


N 

E 

k=l,k^i 


n/2 


E4‘ 

i=i 


i9j) 


then 


N 

E 

k=l,k^i 


n/2 

i=i 


T' ^ e'. 


Using this, for sub-test g, in the last line of Equations 4.13 we get 


N N 


E E 

2 = 1 fc=l;/c^2 


(9) via) _ 


N 


= n 


/2fT'Y, 


2 = 1 


n/2 


iaj) 


T.i>r 




Eor sub-test h. 


N N N 

E E « (n/2)2r'^ 

2=1 k=l-,k^i 


2=1 


'n/2 


(4.16) 


(4.17) 


where the approximation in the above statement is of the order as in statement 4.17, enhanced by =F(n/2)^T'e', 
following statement 4.16. Then 


N N N N 

2=1 k=l;k^i 2=1 k=l;k^i 


(4.18) 


where the approximation is of the order of {n/2)'^T'e'. 
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N 


Si„ce|Ex“| -Y,{XI 


N 


(9)r - 


N 


N 


V j=l 


i=l 


E E , as in the last line of Equations 4.1 1-statement 4.18 

i=l k=l;k^i 


tells us that for the i^-th and the h-th sub-tests, 


N 


EAt'l -E(^. 


N 


\ i=l 


i=l 


-( 9 ) 


N 


\.i=l 


(h) 


N 


EW"’)' 


(4.19) 


where, as for statement 4.18, the approximation is of the order of (n/2)^T'e'. But by squaring both sides of 


i=l 

2'-nl J 


N 


Equation 4.10 we get 


( 9 ) 


N 


x\^'^ j , where the approximation is of the order of =F 2Te. 


s 2 = 1 


^ 2=1 


Then in statement 4.19 we get 


E(x''”)'-E(^.''‘’)'> 


i=l 


i=l 


the approximation in which is of the order of =F ^Te ± (n/2)^T'e', i.e. of the order of e^. 


(4.20) 

□ 


Remark 4.3. It is to be noted in the proof of Theorem 4.2 that even if the sum of scores of the two sub-tests 
are equal, as per our method of splitting, i.e. even if e is achieved to be 0, absolute difference between sums 
of squares of scores in the two sub-tests is not necessarily 0, since T and {n(2)‘^T' are not necessarily equal. 


N 

Now, variance of sub-tests g and h is respectively, 

2 = 1 




N 


N 


^9 and J] 






2=1 


N 


- Xl Then 


near-equality of sums of squares of sub-test scores, (Theorem 4.2) and near-equality of means (Theorem 4.1) 
imply that the difference between sub-test variances is small. Empirical confirmation of near-equality of sub¬ 
test variances is presented in later sections. Now item scores manifest item difficulty. Therefore, splitting 
using item scores is equivalent to splitting using item difficulty values. Erom the near-equality of variances 
it follows that, if instead of allocating items into the two sub-tests, on the basis of the difficulty value of the 
item, we split the test according to item variance, we would get nearly the same sub-tests. Thus our method 
of dichotomisation with respect to item scores (or difficulty values), is nearly equivalent to dichotomisation 
with respect to item variance. 

The near-equality of means and variances of the sub-tests are indicative of the sub-tests being nearly 
parallel, since parallel tests have equal means and variances. Then the item scores of the two sub-tests can be 
taken as coming from nearly the same populations with nearly same density functions having two parameters, 
namely mean and variance. 

Eor parallel sub-tests, variances are equal, and the regression of the item score vector of the g-th sub-test 
on the h-th, coincides with the regression of the scores of the h-th sub-test on that of the g-th', such coinciding 
regression lines imply that the Pearson product moment correlation between and 

or the split-half regression coefficient Vgh, is maximal, i.e. unity (Stepniak & Wasik 2009). Our method of 
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splitting the test results in nearly parallel sub-tests, so that the split-half regression is close to, but less than 
unity. The closer |5| is to 0, the more parallel are the resulting sub-tests g and h, and higher is the value of 
the attained rgh- Thus, the method gives a simple way of splitting a test in halves, while ensuring that the 
split-half reliability is maximum. 

The iterative method also ensures that the two sub-tests are almost equi-correlated with a third variable. 
Thus, these (near-parallel) sub-tests will have almost equal validity. 

The problem of splitting an even number of non-negative item scores into two sub-tests with the same 
number of items, such that absolute difference between sums of sub-test item scores is minimised, is a simpler 
example of the partition problem that has been addressed in the literature (Hayes 2002; Borgs, Chayes & 
Pittel 2001, among others). Detailed discussion of these methods is outside the scope of this paper, but within 
the context of test dichotomisation aimed at computing reliability, we can see that the method of assigning 
even numbered items to one sub-test and odd-numbered ones to another as a method of splitting a test into two 
halves, cannot yield sub-tests that are as parallel as the sub-tests achieved via our method of splitting, because 
an odd-even assignment does not guarantee that the sum of item scores in one sub-test approaches that of the 
other closely, unless the difficulty value of all items are nearly equal. Thus, a test in which consecutive items 
are of widely different difficulty values, will yeld sub-tests with means that are far from each other, if the test 
is dichotomised using an odd-even method of assignment of items to the sub-tests; in comparison, our method 
of splitting is designed to yield sub-tests with close values of variances and even closer means. Following on 
from the low order of our algorithm (see Remark 4. 1), a salient feature of our method is that it is a fast method, 
the order of which depends on the number of items in the test and not the examinee number, thus allowing 
for fast splitting of the test score data and consequently, fast computation of reliability. 

4.4. Estimation of True Score 

When the reliability is computed according to the method laid out above, it is possible to perform interval 
estimation of true score of an individual whose obtained score is known, where the interval itself represents 
the 1 standard deviation of the distribution of the error score E. In other words, the distribution of the error 
score E is assumed-normal with zero mean and variance S'^ = — rtt) (as per the classical definition 

of reliability); using the error variance, errors on the estimated true scores are given as Se- Below we discuss 
this choice of model for the interval on the estimate of the true scores. 

At a given observed score X, the true score can be estimated using the linear regression: 


T = cci + I3\X 


( 4 . 21 ) 
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where the regression coefficients are /3i = vxt-^ = and ai = X{1 — j3i). 

That the error on the estimated true score is given by XSe, is indeed a model choice (Kline 2005) and is 
in fact, an over-estimation given that the error e* of estimation is 

ei = (ai + (3iXi - Xif = (X(l - r«) + - X^f = [{X, - X){1 - ru)f 

(recalling that ai = X(1 — rtt) and /3i = rtt)- Then 

N 



= {l-rufsl<{l-rtt)Sl = Sh 


N 



i.e. S'^ — = (1 — rtt)Sx — (1 — = rtt(l — rtt)S\. In other words, the standard error of 

prediction of the true score from a linear regression model, falls short of the error variance by the amount 
rttiX — rtt)S\. Thus, the higher the reliability ru of the test, the less is the difference between the standard 
error of prediction and our model of the error (on T). In general, our model choice of the error on T is higher 
than the standard error of prediction, for a given X. 

Using T of each individual taking the test, one may undertake computation of the probability that the 
percentile true score of the i-th examinee is t given the observed percentile score of the examinee is x and 
reliability is r, i.e. Pr(r < t\rtt = r,X < x). We illustrate this in our analysis of real and simulated test 
score data. 

In the context of estimating true scores using a computed reliability, we realise that using the split-half 
reliability Xgh to estimate true scores Tspiu-haif will be in excess of the estimate Tdassicai obtained by using 
the classically defined reliabilify ru, for high values of fhe observed scores, and under-esfimafed compared 
to Tdassicai for fow valucs of X. This can be easily realised. 

Theorem 4.3. Tsput—haif ^ Tdassicai faf X > X and Tgpiu—half ^ Tdassicai for X <f X. 


Proof. 


Tsplit-half — Tdassicai = [X + rgh{X — X)] — [X + ru{X — X)] 

= {X - X){rgh - ru) 


(4.22) 


> 0 for X > X and <0 for X < X 


given fhaf rgh > ru- 

Therefore Tsput—haif ^ Tdassicai for X ^ X and Tsput—haif ^ Tdassicai for X ^ X. 


□ 
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5. Reliability of a battery 

The above method of finding reliability, as per the classical definition, can be extended to compute the relia¬ 
bility of a battery of tests. Battery scores of a set of examinees are obtained as a function of the scores of the 
constituent tests. After administration of the battery to N individuals, S^, S^, S'^ and rtt of each constituent 


test are known as per the method described above. The method of obtaining the reliability of the battery will 


depend on these parameters as well as on the definition of the battery scores. Usually, battery scores are com¬ 
puted as the sum of the scores in the individual constituent tests (summative scores) or as a weighted sum of 
these test scores. However, the possibility of a non-linear combination of test scores to depict battery scores 
cannot be ruled out. Here we discuss the weighted sum of component test scores after motivating the concept 
of summative scores; we illustrate our method of finding the weights on real data (Section 7). 

5.1. Weighted sum of component tests 

Suppose a battery consists of two tests: Test-1 and Test-2. Let the t-th examinee’s scores in the 2 constituent 
tests be Xu and X 2 i. Let this examinee’s true scores and error score of the j-th constituent test be Tji and Eji 
respectively, j = 1, 2. Then the summative score of the z-th examinee is Xi = Xu + X 2 i assuming additivity 
and equal weights. Let and r^q 2 ) denote reliability of the respective constituent tests. 


Now, 


var{X) = war{Xi) fi- war{X 2 ) + 2cov{Xi,X2) 


(5.1) 


suggests 




(5.2) 


since ^ TuE 2 i = ^ T 2 iEu = ^ EuE 2 i = 0. 


Again 


cov{Xi,X 2 ) 



cov{Ti,T2) 


( 5 . 3 ) 
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Now variance of true score of the battery is 


SriBaUevy) = var(Ti) + var(T 2 ) + 2cov{Ti,T2) 


= var{Ti)+ \/ar{T 2 )+ 2cov{Xi,X2) 

= '''u{i)Sxi +rtt{2)Sx2 + 2 cov(Xi,X2) 


(5.4) 


where we have used Equation 5.3. Thus, reliability of a battery can be found using Equation 5.4. Using these 
equations, reliability of the battery r^^f^attery) is given by 


rtt{i)Sxi + rtt{2)Sx2 + 2cov{Xi,X2) 

Sx,+Sx2+2cov{XuX2) 


(5.5) 


^tt{battery) 


In general, if a battery has iiT-tests and all tests are equally important then the reliability of the battery, where 
the battery score is a summative score of the K constituent tests, will be given by 



(5.6) 



2 = 1 
K 


1. To find reliability of such a battery, let us consider the composite score or the battery score Y = WiXi 



Then 



(5.7) 


However, a major problem in the computation of the reliability coefficient in this fashion could be experi¬ 
enced in determination of the weights. 

5 . 2 . Ourmethod for determination of weights 

We propose a method towards the determination of the vector of weights W = W 2 , ■ ■ ■, Wk)"^ with 

K 

Wi = 1, such that the variance of the battery score is a minimum where the battery score vector is Y 

2=1 

with var(l^) = W^DW and D is the variance-covariance matrix of the component tests. Erom a single 


administration of a battery to N examinees, the variance-covariance matrix D is such that the f-th diagonal 
element of this matrix is the variance Sxi of the f-th constituent test, and the zj-th off diagonal element 
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is cov{Xi, Xj). This is subject to the condition W'^e = 1 where e is the iT-dimensional vector, the i- 
th component of which is 1, V i = 1,..., iiT. It is possible to determine the weights using the method of 
Lagrangian multipliers A. To this effect, we define A := DW + A(1 — W'^e) and set the derivative of 
A taken with respect to W and A to zero each, to respectively attain: 

2DW — Ae = 0 and 1 — = 0. 


Thus, 


W = -D-^e and W'^e = 1, 
2 


or W = 


D-^e 


and A = 




5.3. Benefits of this method of finding weights 


(5.8) 


This method of finding reliabilify of a baffery of a number of fesfs can be used in fhe confexf of summafion 
score, difference score or weighfed sum of scores. Weighfs found as above have fhe advanfage fhaf fhe baf¬ 
fery score or composife score {Y) has minimum variance. Also, covariance befween fhe baffery score and 
fhe fesf score of an individual fesf is a consfanf, i.e. cov(lj, Xi) = ———,— V i. If fhe available fesf scores 
are sfandardised and independent such that the f-th score is Zj, then weights are equal, and correlation be- 

= y i = j, i,j = 


tween Y and Z, is the same as correlation between Y and Z, = 


Ve^R "e 

where R is the correlation matrix. In other words, the battery score is equi-correlated with the standardised 
score of each constituent test. The method alone does not guarantee that all weights be non-negative. This 
is especially true if the tests are highly independent so that R is too sparse. Non-negative weights can be 
ensured by adding the physically motivated constraint that Wj > 0 V f, in which case, the problem boils down 
to a Quadratic Programming formulation. This is similar to the data-driven weight determination method 
presented by Chakrabartty (2013). 

If it is found desirable to get weights so that Xi is proportional to cov(y, Xi) then it can be proved that 

W is the eigenvector corresponding to the maximum eigenvalue of the variance-covariance matrix of the test 

cov(Y,Xi) . S-^U 


scores. If on the other hand, we want Wj as proportional to 


-, then W = 


where S is the 


variXi) ’ . e^S-^U 

diagonal matrix of the standard deviations of the constituent tests and U is the maximum eigen-value of the 

correlation matrix. 

In order to find fhe baffery scores, weighfs of fest scores can be found by principal componenf analysis. 


faclor analysis, canonical reliabilify, efc. However, if is suggesfed fhaf reliabilify of fhe composife scores be 
found as weighted scores (Equation 5.7) where the weights are determined as per Equation 5.8. 
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So far, we have discussed additive models. We could also have multiplicative models like Y = UY‘ 
so that on a log-scale, we retrieve the additive model. 

We illustrate our method of finding weights using real data in Section 7. 

6. Application to simulated data 

In order to validate our method of computing the classically defined reliabilify following dichofomisafion of 
a fesf info parallel groups, we use our mefhod fo find fhe value of ru of 4 toy fesfs, fhe scores of which are 
simulafed from chosen models, as described below. We simulafe fhe 4 fesf dafa sefs Di, D 2 , D^, D 4 under 4 
disfincf model choices; fhe underlying sfandard model is fhaf fhe score ' is obfained by fhe f-fh examinee 
fo fhe j-fh ifem in a fesf, is a Bernoulli variafe, i.e. 

~ Bernoulli(p) implying Pr(Ap^ = 1) =p, Pr(Xp^ = 0) = 1 — p 

where fhe probabilify Pr{xl^^ = 1 ), of answering fhe j-fh ifem correcfly by fhe z-fh examinee 

• is held a consfanf pi for fhe z-fh examinee V j = 1,..., n, wifh pi sampled from a uniform disfribufion 
in [0,1], V z = 1,... , A^, in dafa Dy. 

• is held a consfanf pi for fhe z-fh examinee V j = 1,..., rz, wifh pi sampled from a Normal disfribufion 
AA(0.5,0.2), V z = 1 ,..., A^, in dafa D^. 

• is held a consfanf pj for fhe j-fh ifem for all examinees, wifh pj sampled from a uniform disfribufion in 
[0,1], V j = 1,..., rz, in dafa D 2 . 

• is held a consfanf pj for fhe j-fh ifem for all examinees, wifh pj sampled from a Normal disfribufion 
AA(0.5,0.2), V j = 1,..., rz, in dafa D^. 

We use rz=50 and N=999 in our simulations. 

Thus we realise that data sets Dy and resemble test data in reality, with the ability of the z-th examinee 
represented by pi. Our simulation models are restrictive though in the sense that variation with items is 
ignored. Such a test, if administered, will not be expected to have a low reliability. On the other hand, the 
data sets D 2 and 1)4 are toy data sets that are utterly unlike real test data, in which the probability of correct 
response to a given item is a constant, irrespective of examinee ability, and examinee ability varies with item- 
equally for all examinees. A toy test data generated under such an unrealistic model would manifest low test 
variance and therefore low reliability. Given this background, we proceed to analyse these simulated tests 
with our method of computing ru- 

We implement our method to compute reliabilities of all 4 data sets (using Equation 4.6). The results are 


given in Table 1. 
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Table 1 

Table showing results of using our method of dichotomisation of 4 simulated test data sets, Di,..., D 4 , into 2 parallel sub-tests, g 
and h , resulting in the computation of classically-defined reliability rtt of the test (see Section 4.2 for details of our method). 


Data set 

Test 

variance 

Sum of item score 

9 

s XgSzXh in sub-tests 

h 

Sum of squares of 

9 

tern scores in sub-test 

h 


Reliability 

rtt 

Di 

222.89 

12014 

12013 

202506 

201523 

198257 

0.9662 

D 2 

7.96 

10440 

10439 

113212 

112843 

109133 

0.02050 

D 3 

110.06 

12597 

12597 

188727 

189465 

183564 

0.8994 

Di 

10.86 

12683 

12684 

166199 

166520 

161130 

0.0361 


O 

C 

QJ 

cr 

(U 




Fig 6. 1. Figure showing histograms of the scores of999 examinees in the 2 sub-tests (in solid and broken lines respec¬ 
tively), obtained by splitting the simulated test data set D 3 which has been generated under the choice that examinee 
ability is normally distributed (right panel). The left panel includes histograms of the 2 sub-tests that result from split¬ 
ting the test data D 4 that was simulated using examinee ability as item-dependent, with probability for correct response 
to the j-th item given as a normal variate, unrealistically fixed for all examinees, V j = 1,..., 50. 


The reliabilities of tests with data Dy and D 3 are as expected high, while the same for the infeasible tests 
D 2 and £>4 are low, as per prediction. We take these results as a form of validation our method of computing 
Tu based on the splitting of a test. 

In Figure 6.1, we present the histograms of the 2 sub-tests that result from splitting the realistic data D^, as 
well histograms of the 2 sub-tests that result from the splitting of the unrealistic test data D 4 . 
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6.1. Results from Large Simulated Tests 

We also conducted a number of experiments with finding reliabilities of larger test data sets that were sim¬ 
ulated. The simulations were performed such that the test score variable has a Bernoulli distribution with 
parameter p. in these simulations, we chose pi as fixed for the f-th examinee, with pi randomly sampled from 
a chosen Gaussian pdf, i.e. pi ~ AA(0.5, 0.2) by choice, i = 1,. .. , N. We simulated different test score data 
sets in this way, including 

- a test data set for 5x10^ examinees taking a 50-item test. 

- a test data set for 5x10^ examinees taking a 50-item test. 

- a test data set for 1000 examinees taking a 100-item test. 

- a test data set for 1000 examinees taking a 1000-item test. 

In each case, the test data was split using our method and reliability of the test was computed as per the 
classical definition. The 4 simulated test data sets mentioned above, yielded reliabilities of 0.96, 0.98, 0.93, 
0.85, in order of the above enumeration. Histograms of the sub-tests obtained by splitting each test data were 
over-plotted to confirm fheir concurrence. 

Importanfly, the run-time of reliability computation of these large cohorts of examinees (A^=500,000 and 
50,000), who take the 50-item long test, is very short-from about 0.8 seconds for the 50,000 cohort to about 
6.2 seconds for the 500,000 cohort. On the other hand the order of our splitting algorithm being 0((n/2)^), 
the run-times increase rapidly for the 1000-item test, from the 100-item one, with the fixed examinee number. 
These experiments indicate that the computation of reliabilities for very large cohorts of examinees, in a test 
with a realistic number of items, is rendered very fast indeed, using our method. 

7. Application to real data 

We apply our method of computing reliability using the classical definition, following the splitting of a real 
test into 2 parallel halves by the method discussed in Section 4.2. We also estimate the true scores and offer 
the error in this estimation, using the computed reliability and a real test score data set. The used real test 
data was obtained by examining 912 examinees in a multiple choice examination administered with the aim 
of achieving selection to a position. The test had 50 items and maximum time given was 90 minutes. To 
distinguish between this real test data and other real test data sets that we will employ to determine reliability 
of a battery, we refer to the data set used in this section as DATA-I. For the real test data DATA-I, the 
histograms of the item scores and that of the examinee scores are shown in Figure 7.1. The histogram of the 
examinee scores shows that Aj < 26 Vi = 1,..., 912, with the biggest mode of the score distribution around 
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Item scores 


Fig 7.1. Figure showing histograms of the 912 examinee scores (left panel) and of 50 item scores (right panel) of the 
real test data DATA-I that we use to implement our method of splitting the test, aimed at computation of reliability as 
per the classical definition. The left panel also includes histograms of the g and h sub-tests that result from the splitting 
of this test; the histograms of the sub-tests are shown in dotted and dashed-dotted lines, superimposed on the histogram 
of examinee scores for the full test (in solid black line). 


11. In other words, the test was a low-scoring one. DATA-I indicates 2 other smaller modes at about 15 and 
22. Some other parameters of the test data DATA-I are as listed below. 


- number of items n=50, 

- number of examinees is A^=912, 

- magnitude of the vector of maximum possible scores is || / || = \/912 x 50^ 1509.97, 

- magnitude of the observed score vector is || X ||Ri 357.82, 

- II X f = 128032, 


cos Ox «0.9275 
test mean X «10.99, 


- test variance S\ =« 19.63. 


The test was dichotomised into parallel halves by the iterative process discussed in Section 4.2. Let the 

resulting sub-tests be referred to as the q and h sub-tests. Then each of these sub-tests had 25 items in it 

912 912 

and sum of examinee scores is = = 5011, so that the mean of each of the two sub- 


2 = 1 


2=1 
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Fig 7.2. Figure showing regression of the examinee scores in the sub-test g on the scores in sub-test h (right panel) and 
vice versa (left panel), where the 2 sub-tests are obtained by splitting real data DATA-I with our method of dichotomi- 
sation (Section 4.2). The score data are shown infilled circles while the regression lines are also drawn. The regression 
coefficients are marked on the top left hand corner of the respective panel. 

tests is equal to about 5.49. Also, variances of the 2 sub-tests are approximately equal at 6.81 and 6.49. The 
histograms of the examinee scores in these 2 sub-tests are drawn in blue and red in the left panel of Figure 7.1. 
That the histograms for the g and h sub-tests overlap very well, is indicative of these sub-tests being strongly 
parallel. We regress the score vector , ■ ■ ■, ^ 912 )^ the score vector , • • •, the regression 

lines are shown in the right panel of Figure 7.2. Similarly, regressing , ■ ■ ■, ^ 912 )^ on the score vector 
{X^\ ... , Xgl^)'^ results in a similar linear regression line (shown in the left panel of Figure 7.2). Table 2 
gives the details of splitting of the 50 items of the full test into the 2 sub-tests. 

7.1. Computation of reliability 

We implement the splitting of the real test score data into the scores in the g and h sub-tests to compute the 
reliability as per the theoretical definition, i.e. as given in Equation 4.6. Then using the observed sub-test 
score vectors and we get the classically defined reliability to be ru ~ 0.66. Then rxT '■= s/rft ~ 
0.8128. On the other hand, the Pearson product moment correlation between X^^^ and X^^'^ is rgh RiO.9941. 
In other words, the split-half reliability is Vgh « 0.9941. 

We can now proceed to estimate the true scores following the discussion presented in Section 4.4. For 
the observed score Xj, the estimated true score is Tj = /3iXj -|- ai (as per Equation 4.21 and discussions 
thereafter), where j5i = ru and ay = X{1 — (5i). The true scores can also be estimated using the split- 
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Table 2 

Splitting of real data DATA-I of a selection test with 50 items, administered to 912 examinees. The dichotomisation has been carried 

out according to the algorithm discussed in Section 4.2. 


g-\h sub-test 


h-\h sub-test 

Difference between scores of 2 tests 

Score 

Item No. 

Score 

Item no. 


75 

25 

75 

1 

0 

84 

39 

80 

46 

4 

85 

43 

90 

24 

-5 

100 

20 

96 

9 

4 

102 

44 

103 

34 

-1 

111 

50 

106 

31 

5 

112 

5 

113 

41 

-1 

124 

32 

115 

45 

9 

127 

36 

128 

8 

-1 

131 

33 

129 

26 

2 

134 

18 

135 

3 

-1 

151 

29 

144 

4 

7 

172 

7 

178 

48 

-6 

193 

19 

190 

21 

3 

195 

37 

196 

14 

-1 

205 

28 

199 

47 

6 

221 

30 

230 

6 

-9 

239 

27 

236 

49 

3 

256 

11 

263 

12 

-7 

284 

2 

285 

17 

-1 

292 

22 

294 

10 

-2 

337 

15 

309 

23 

28 

393 

13 

376 

42 

17 

405 

16 

453 

38 

-48 

483 

35 

488 

40 

-5 

Sum of item scores 

Sum of squares of item scores 

5011 

33453 


5011 

33747 
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half reliability instead of the reliability from the classical definition. Then the above regression coefficients 
Pg and Uq are computed as above, except this time, ru is replaced by Vgh, q = 1,2. We present the true 
scores estimated using both the reliability from the classical definition {Tdassicai in black) as well as the 
split-half reliability {Tgput-haif in red) in Figure 7.3. In this figure, the errors in the estimated true scores 
are superimposed as error bars on this estimate, in respective colours. This error is considered to be ±Se, 
where the error variance is S’!; = (1 — rtt)S\ when the classically defined reliability is implemented and 
= (1 — rgh)S\ when the split-half reliability is used. 

In fact, the method of using rgh in the estimation of the true scores will result in the true scores Tspiu-hai / 
being over-estimated for X > X and under-estimated for X < X, as shown in Theorem 4.3. This can be 
easily corroborated in our results in Figure 7.3; the higher value of Vg^ (than of ru) results in Tspiu_haif > 
'^classical for X > X and in Tgpiu—kaif — '^classical for X < X. 

Figure 7.3 also includes the sample probability distribution of the point estimate of the true score obtained 
from the linear regression model suggested in Equation 4.21, given the observed scores and using ru and rgh, 
i.e. Pr(T < t\r, X), where r is the reliability. We invoke the lookup table that gives the examinee indices 
corresponding to a point estimate of the true score. Then the percentile rank of all those examinees whose 
true score is (point) estimated as T, is given as 100 x Pr(T < t\r,X). (This lookup table is not included in 
the text for brevity’s sake). For example, the true score Tgpiu_haif £ [20, 21) is estimated using r = rgh, for 
examinees with indices 893, 867, 210, 837, 834, 408, 706, 690, 655, 653, 161, 638, 312, 308, 149, 290, 539, 
260. Then the percentile rank of all these examinees is about 93. 

Thus, our method of splitting the test into 2 parallel halves helps to find a unique measure of reliability 
as per the classical definition, for a given real data set. Using this we can then estimate true scores for each 
observed score in the data. 

7 . 2 . Weighted battery score using real data 

In order to illustrate our method of finding weights relevant to the computation of the classically defined 
weighted reliability of a test battery (Equation 5.7), we employ real life test data sets DATA-II(a) and DATA- 
11(b). These data sets comprise the examinee scores of 784 examinees who took 2 tests (aimed at selection for 
a position). The test that resulted in DATA-II(a) aimed at measuring examinee verbal ability, while the test for 
which DATA-II(b) was the result, measured ability to interpret data. The former test contained 18 items and 
the latter 22 items. Histograms of examinee scores and item scores of these 2 tests are shown in Eigure 7.4. 
The mean and variance of the 2 tests corresponding to DATA-II(a) and DATA-II(b) are approximately 4.30, 
6.25 and 4.20, 4.06 respectively. The battery comprising these two tests is considered for its reliability. 
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Estimated true score values (t) 


Fig 7.3. Left: figure showing the true scores estimated at observed examinee scores of the real test data DATA-I. The 
true scores estimated using the reliability computed from the classical definition is shown in black filled circles; the 
corresponding estimates of the error scores are superimposed as error bars. The true scores estimated using the split- 
half reliability are plotted in red crosses, as are the corresponding error scores. Right: the sample cumulative probability 
distributions of point estimate of true scores obtained from the linear regression model given in Equation 4.21 where 
implementation of the classically defined reliability results in the plot in black circles while that obtained using split- 
half reliability is in red crosses. This cumulative distribution can be implemented to identify the percentile rank of an 
examinee, in conjuction with the lookup table containing the ranking of examinee indices by the point estimate of the 
true score. 
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Fig 7.4. Left: figure showing histograms of the examinee scores in the tests, the scores of which constitute data sets 
DATA-II(a) (in black) and DATA-II(b) (in red dotted line). Right: figure showing histograms of item scores in real data 
DATA-II(a) (in black solid lines) and in DATA-II(b) (in red broken lines). 


We split each test into 2 parallel halves using our method of dichotomisation delineated in Section 4.2. 
Splitting DATA-II(b) results in 2 halves, the means of the examinee scores of which are about 2.10 and 
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the variances of the examinee scores are about 1.80 and 1.65. Reliability defined classically (Equation 4.6) 
computed using these split halves is about 0.35. The correlation between the split halves is about 0.90. On 
the other hand, splitting DATA-II(a) results in 2 halves, the means of which are equal to about 2.15 and the 
variances of which are about 1.87 and 2.19. Classically defined reliability of this test using these split halves, 
turns out be 0.66 approximately. The correlation coefficient between the two halves is about 0.95. The true 
scores estimated for both observed data sets are shown in Figure 7.5, with errors of estimation superimposed 
as ±5^; where is the variance of the error scores that are modelled as distributed as Af{0, S'^). 
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Fig 7.5. Figure showing true scores estimated using the observed scores in real data set DATA-II(a) (in black filled 
circles) and in DATA-II(b) (in red crosses). Errors of estimate are depicted as error bars that correspond to ±(1 — 
rtt)‘^S\ where the Tu is the classically defined reliability computed using the respective real data set. 


Using the scores in the 2 tests in the considered battery, namely sets DATA-II(a) and DATA-II(b), we com¬ 
pute the variance-covariance matrix D of the observed scores (see Equation 5.8). Then in this example, D is 
a 2x2 matrix, the diagonal elements of which are the variances of the 2 tests and the off-diagonal elements 
are the covariances of the examinee test score vectors Xi and X 2 of these 2 tests. Then in this example 
real-life test battery, recalling that e = (1,1)^ we get 

( 6.24 2.46 \ . ^ .T T t 

, so that i9~^e Ri (0.08,0.20)\ e^E)~^e Ri 0.28. 

y 2.46 4.06 ) 

Then the weights are Ri 0.7028 and 0.2972 (to 4 significant figures). 

Then recalling that (upto 4 decimal places) the reliability and test variance of the 2 tests the score data of 
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which are DATA-II(a) and DATA-II(b) are 0.6571, 6.2405 and 0.3488, 4.0571 respectively, the covariance 
between the 2 tests is 2.4571, we use Equation 5.7 to compute the reliability of this real battery to be approx¬ 
imately 

0.6571 X 6.2405 x 0.7028^ + 0.3488 x 4.0571 x 0.2972^ + 2 x 0.7028 x 0.2972 x 2.4571 ^ ^ 

0.70282 X 6.2405 + 0.2972^ x 4.0571 + 2 x 0.7028 x 0.2972 x 2.4571 

Thus, using our method of splitting a test in two parallel halves, we have been able to compute reliability of 
the test as per the classical definition and extended this to compute the reliability of a real battery comprising 
two tests. For the used real data sets DATA-II(a) and DATA-II(b), the battery reliability turns out to be about 
0.71. 

8. Summary & Discussions 

The paper presents a new, easily calculable split-half method of achieving reliability of tests, using the classi¬ 
cal definition where the basic idea implemented is that the square of the magnitude of the difference between 
the score vectors Xg and of N examinees in the g and h sub-tests obtained by splitting the full test, is 
proportional to the variance S'^ of the errors in the scores obtained by the examinees who take the test, i.e. 
II Xg — Xfi 112= NS'^. Here, working within the paradigm of Classical Test Theory, the error in an exami¬ 
nee’s score is the difference between the observed and true scores of the examinee. Our method of splitting 
the test is iterative in nature and has the desirable properties that the sample distribution of the split halves are 
nearly coincident, indicating the approximately equal means and variances of the split halves. Importantly, 
the splitting method that we use, ensures maximum split-half correlation between the split halves and the 
splitting is performed on the basis of difficulty of the items (or questions) of the test, rather than examinee 
attributes. A crucial feature of this method of splitting is that the splitting being in terms of item difficulty, the 
method requires very low computational resources to split a very large test data set into two nearly coincident 
halves. In other words, our method can easily be implemented to find fhe classically defined reliabilify using 
fesf dafa fhaf is obfained by collafing responses from a very large sample of examinees, on whom a fesf of as 
large or as small a number of items has been adminisfered. In our demonsfrafion of fhis mefhod, a moderafely 
large real fesf on 912 examinees and 50 items, convergence fo optimum spliffing (spliffing info halves fhaf 
share equal means and nearly equal variances) was achieved in abouf 0.024 seconds. Implemenfafion of sefs 
of foy dafa, generafed under differenl model choices for examinee ability was underfaken: 999 examinees re¬ 
sponding fo a 50-item fesf, as well as much larger cohorts of examinees-500,000 fo 50,000-faking a 50-ifem 
fesf, and a cohorf of 1000 examinees faking 100 fo 1000-ifem fesfs. The order of our spliffing algorifhm is 
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corroborated to be quadratic in half of the number of items in the test, while computational time for reliability 
computation (input+output times, in addition to splitting of the test) varies linearly with examinee sample 
size, so that even for the 500,000 examinees taking the 50-item test, reliability is computed to be less than 10 
seconds. Once the reliability of the test is computed, it is exploited to perform interval estimation of the true 
score of each examinee, where the error of this estimation is modelled as the test error variance. 

Subsequent to the dichotomisation of the test, we invoke a simple linear regression model for the true 
score of an examinee, given the observed score X, to achieve an interval estimate of the true score, where 
the interval is modelled as ±15^. We recognise this interval to be in excess of ±1 standard deviation of the 
error of estimation of the true score for a given X, as provided by the regression model; in other words, our 
estimation of uncertainty on the estimated true score is pessimistic. 

This method of splitting a test into 2 parallel halves, forms the basis of our computation of the reliability of 
a battery, i.e. a set of tests, as per the classical definition. A weighted battery score is used in this computation 
where we implement a new way for the determination of the weights, by invoking a Lagrange multiplier based 
solution. We illustrate the implementation of this method of determining weights-and thereby of computing 
the reliability of a test battery following the computation of the reliabilities of the component test as per the 
classical definition-on a real test battery that comprises 2 component tests. 

We have presented a new method of computing reliability as per the classical definition, and demonstrated 
its proficiency and simplicify. Thus, if is possible fo uniquely find fesf characferisfics like reliabilify or error 
variance from fhe dafa, and such can be adopfed while reporting fhe resulfs of fhe adminisfered fesf. In fhis 
paradigm, fesfing of hypofhesis of equalify of error variance from fwo fesfs will help fo compare fhe fesfs. 
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